library(readr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(Ecdat)
## Loading required package: Ecfun
## 
## Attaching package: 'Ecfun'
## 
## The following object is masked from 'package:base':
## 
##     sign
## 
## 
## Attaching package: 'Ecdat'
## 
## The following object is masked from 'package:datasets':
## 
##     Orange
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(FNN)
library(ggrepel)
library(e1071)
library(gains)
library(rpart)
library(rpart.plot)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
library(ROCR)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## 
## The following object is masked from 'package:e1071':
## 
##     impute
## 
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## 
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## 
## The following object is masked from 'package:base':
## 
##     Recall
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Step I: Data Preparation & Exploration (15 points)

Read your data into your local environment, and subset/filter the data so that you are dealing only with the records that pertain to your team’s city.

## Set your directory. Select your own folder to load the data. 
setwd("~/Library/CloudStorage/OneDrive-Personal/Boston University/MET AD699 Data Mining for Business Analytics/Final project")
train_csv <- read_csv("train.csv.csv")
## Rows: 74111 Columns: 29
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (12): property_type, room_type, amenities, bed_type, cancellation_polic...
## dbl  (10): id, log_price, accommodates, bathrooms, latitude, longitude, numb...
## lgl   (4): cleaning_fee, host_has_profile_pic, host_identity_verified, insta...
## date  (3): first_review, host_since, last_review
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Filter for NYC
df <- read.csv("train.csv.csv")
nyc_data <- df %>%
  filter(city == "NYC")

I. Missing Values

A. Does your data contain any missing values and/or blank cells? If so, what can you do about this? Show the R code that you used to handle your missing values. Write one paragraph describing what you did, and why.

# Handling NA values
missing_values <- colSums(is.na(nyc_data))
nyc_data[nyc_data == ""] <- NA
head(nyc_data)
##         id log_price property_type       room_type
## 1  6901257  5.010635     Apartment Entire home/apt
## 2  6304928  5.129899     Apartment Entire home/apt
## 3  7919400  4.976734     Apartment Entire home/apt
## 4  5578513  4.605170     Apartment    Private room
## 5 17589436  4.882802     Apartment Entire home/apt
## 6 18224863  4.595120         House Entire home/apt
##                                                                                                                                                                                                                                                                                                                                                amenities
## 1                                                                                                                                                                                               {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Essentials,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}
## 2                                                                                                                             {"Wireless Internet","Air conditioning",Kitchen,Heating,"Family/kid friendly",Washer,Dryer,"Smoke detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"translation missing: en.hosting_amenity_50"}
## 3                                {TV,"Cable TV","Wireless Internet","Air conditioning",Kitchen,Breakfast,"Buzzer/wireless intercom",Heating,"Family/kid friendly","Smoke detector","Carbon monoxide detector","Fire extinguisher",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace","translation missing: en.hosting_amenity_50"}
## 4                                                                                                     {Internet,"Wireless Internet","Air conditioning",Kitchen,"Pets allowed","Buzzer/wireless intercom",Heating,Washer,Dryer,"Smoke detector",Essentials,Hangers,"Hair dryer","Laptop friendly workspace","translation missing: en.hosting_amenity_50"}
## 5                                                                                       {TV,"Cable TV",Internet,"Wireless Internet","Air conditioning","Wheelchair accessible",Kitchen,Elevator,"Buzzer/wireless intercom",Heating,"Smoke detector","Carbon monoxide detector",Essentials,Shampoo,Hangers,"Hair dryer",Iron,"Laptop friendly workspace"}
## 6 {TV,"Wireless Internet","Air conditioning",Kitchen,Breakfast,Heating,"Smoke detector","Fire extinguisher",Essentials,"Lock on bedroom door",Hangers,"Hair dryer",Iron,"Laptop friendly workspace","Private entrance","Hot water","Bed linens","Extra pillows and blankets",Microwave,Refrigerator,"Dishes and silverware","Cooking basics",Oven,Stove}
##   accommodates bathrooms bed_type cancellation_policy cleaning_fee city
## 1            3         1 Real Bed              strict         True  NYC
## 2            7         1 Real Bed              strict         True  NYC
## 3            5         1 Real Bed            moderate         True  NYC
## 4            2         1 Real Bed              strict         True  NYC
## 5            2         1 Real Bed              strict         True  NYC
## 6            8         1 Real Bed              strict         True  NYC
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                description
## 1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and Brooklyn Bridge Park, with their stunning views of Manhattan, and from the great shopping and food.
## 2 Enjoy travelling during your stay in Manhattan. My place is centrally located near Times Square and Central Park with easy access to main subways as well as walking distance to many popular restaurants and bus tours. My place is close to the subway, Totto Ramen, Hell's Kitchen, Ippudo Westside, Empanada Mama, Intrepid Sea, Air & Space Museum. My place has three true bedrooms and one bathroom. The kitchen is stocked with stainless steel appliances like the Keurig machine. The living room is spacious and can accommodate another person thanks to the pull out bed. My place is centrally located to some of the top attractions in the city. Feel free to explore the entire apartment and do not worry about sharing the space with any strangers. This is all yours during your stay. I am available via text/email/phone for anything you might need. - Times Square - Rockefeller Plaza - Central Park  - 5th Avenue Shopping  -Broadway Theater District  - Empire State Building  - Hudson River Express Subwa
## 3 The Oasis comes complete with a full backyard with outdoor furniture to make the most of this summer vacation!!  The unit has high ceilings, a completed renovation throughout, beautiful flood lighting and exposed brick! Best part, total seclusion. You share with no one! The entire unit is yours during your stay.  It's a fully furnished apartment that can hold up to 5 people. The only items you need are a toothbrush and your luggage!!! The unit has high ceilings, a completed renovation throughout, beautiful flood lighting and exposed brick! Not to mention the large backyard complete with ourdoor furniture.  Best part, total seclusion. You share with no one! The entire unit is yours during your stay. The entire unit and backyard Garden area My assistant is available off site via phone.   Other than that, you will be alone for your stay. The neighborhood of central Harlem is very diverse and full of culture! We are a few blocks from Historic 125th Street, The Apollo Theatre, The Schomburg
## 4 This is a bright bedroom in an historic building with exposed brick walls and modern appointments, situated in the most vibrant and eclectic neighborhood in Manhattan!  You'll get an authentic experience during your stay in New York here. I have a room available in a lovely apartment in the East Village, close to the subway (5-10 minutes) and SoHo, Little Italy, and the Lower East Side (Each about 15 minutes).  Also located nearby are the eclectic and Lively St. Marks Place and Tompkins Square Park.   The room has a full-size bed, a large wardrobe, bookshelves, and a closet, and it has two large windows with plenty of light.   The apartment has a fully-equipped kitchen, a living room, a bathroom, a washer, and a dryer, all of which guests are welcome to use. The East Village is a great neighborhood with ample nightlife, and borders many other well-known Manhattan neighborhoods each with unique restaurants, shops, and character.  It is minutes away from the 2nd Avenue station for the F 
## 5 I am renting a very spacious, unique, bright L shape studio filled with character for perfect NYC vibe & getaway. Apt is located Lower East Side (LES) walking distant SOHO,Nolita, Chinatown, lots of restaurants and bars and close to the F,B,D,J,M,Z . Hello all , I am happy to rent my apartment to  respectful, clean individuals. I love my apartment and want to make sure it is in good hands while away.  My apt. is located one of the most convenient and attractive neighborhood  in Manhattan, Lower East Side (LES). It's a very quiet  apartment best fits 2 guests. ( one Queen size real bed 60" x 80" and one single day bed) 15 feet wide window provides great light and eastern views of downtown Manhattan, lying in bed and gazing out the window is lovely. Furniture's are carefully selected by award winner interior designer  and items within the space give the whole apartment a colorful , light and airy / spacious feeling. Any questions I am happy to answer and after 14+ years living in New Yor
## 6                                                                                                                                                                                                      Huge 3 bedroom apartment situated in Flatbush, Brooklyn. Average 8 mins walk to train station. Average 30 mins to Manhattan and 10 mins ride to Barclays Center, Prospect Park, Brooklyn Museum, Kings Theatre and Williamsburg.This apartment minimun booking amount are three guests. Spacious 3 bedroom with lots of closet space.  Private entrance Full kitchen Large living  room Large bedrooms Wifi Queen size bed Private entrance Yes Multicultural neighborhood, shopping areas as well as restaurants that are in close proximity. Shopping and fitness centers within walking distance. Average 8 mins walk to train station and 2 mins to the bus.  Average 30 mins to Manhattan and 10 mins ride to Barclays Center, Prospect Park and Brooklyn Museum. 10 mins walk to Kings Theatre. Bus or train to Williamsburg.
##   first_review host_has_profile_pic host_identity_verified host_response_rate
## 1   2016-06-18                    t                      t               <NA>
## 2   2017-08-05                    t                      f               100%
## 3   2017-04-30                    t                      t               100%
## 4   2013-04-28                    t                      t               100%
## 5   2015-09-26                    t                      t                71%
## 6   2017-07-30                    t                      f               100%
##   host_since instant_bookable last_review latitude longitude
## 1 2012-03-26                f  2016-07-18 40.69652 -73.99162
## 2 2017-06-19                t  2017-09-23 40.76612 -73.98904
## 3 2016-10-25                t  2017-09-14 40.80811 -73.94376
## 4 2013-03-27                f  2016-11-15 40.72388 -73.98388
## 5 2015-08-17                f  2016-12-12 40.71909 -73.99028
## 6 2017-05-09                t  2017-09-15 40.64377 -73.95085
##                                            name    neighbourhood
## 1                Beautiful brownstone 1-bedroom Brooklyn Heights
## 2      Superb 3BR Apt Located Near Times Square   Hell's Kitchen
## 3                              The Garden Oasis           Harlem
## 4            Large East Village Bedroom To Let!    Alphabet City
## 5             Amazing LES apt - cool, bright...  Lower East Side
## 6 3 Bedroom Apartment for Small Group or Family         Flatbush
##   number_of_reviews review_scores_rating
## 1                 2                  100
## 2                 6                   93
## 3                10                   92
## 4                82                   93
## 5                26                   86
## 6                 5                   72
##                                                                                   thumbnail_url
## 1 https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small
## 2 https://a0.muscache.com/im/pictures/348a55fe-4b65-452a-b48a-bfecb3b58a66.jpg?aki_policy=small
## 3 https://a0.muscache.com/im/pictures/6fae5362-9e3a-4fa9-aa54-bbd5ea26538d.jpg?aki_policy=small
## 4           https://a0.muscache.com/im/pictures/21726900/19c843e1_original.jpg?aki_policy=small
## 5                                                                                          <NA>
## 6 https://a0.muscache.com/im/pictures/e9baba99-e67a-413c-9986-9386f8c5fa2f.jpg?aki_policy=small
##   zipcode bedrooms beds
## 1   11201        1    1
## 2   10019        3    3
## 3   10027        1    3
## 4 10009.0        1    1
## 5   10002        1    2
## 6 11226.0        3    3

We chose “NYC” for our case study and, as a first step, filtered the dataset based on the “city” column. We then began the data preparation by checking for missing values. Since blank cells are not automatically recognized as missing (NA) in R, we converted all empty strings to NA using the command df[df == “”] <- NA. This step ensures that R can correctly detect and handle missing data during analysis. We used colSums(is.na(df)) to count the number of missing values in each column. Addressing missing data is essential to avoid bias or errors in our analysis and to maintain the reliability of any statistical models we build from the dataset.

II. Summary Statistics

A. Choose any five of the summary statistics functions shown in the textbook, class slides, or anywhere else to learn a little bit about your data set.

# II. Summary Statistics
# Compute summary statistics
summary_stats <- nyc_data %>% summarise(
  mean_value = mean(log_price, na.rm = TRUE),
  median_value = median(log_price, na.rm = TRUE),
  sd_value = sd(log_price, na.rm = TRUE),
  min_value = min(log_price, na.rm = TRUE),
  max_value = max(log_price, na.rm = TRUE)
)
print(summary_stats)
##   mean_value median_value  sd_value min_value max_value
## 1    4.71934      4.65396 0.6615672         0  7.600402

B. Show screenshots of the results. Describe your findings in 1-2 paragraphs.

The summary statistics for log_price in the NYC dataset reveal key characteristics of the price distribution. The mean log price is 4.719, while the median is 4.653, indicating that the distribution is slightly right-skewed, as the mean is slightly higher than the median. The standard deviation (0.66) suggests moderate variability in property prices.

The minimum log price is 0, which is unusual and may indicate missing or incorrect data entries that need further investigation. On the other hand, the maximum log price is 7.60, showing the upper bound of property prices in the dataset.

III. Visualization

A. Using ggplot, create any five plots that help to describe your data.

# III. Visualization
str(nyc_data)
## 'data.frame':    32349 obs. of  29 variables:
##  $ id                    : int  6901257 6304928 7919400 5578513 17589436 18224863 16679342 14122244 14490881 4734170 ...
##  $ log_price             : num  5.01 5.13 4.98 4.61 4.88 ...
##  $ property_type         : chr  "Apartment" "Apartment" "Apartment" "Apartment" ...
##  $ room_type             : chr  "Entire home/apt" "Entire home/apt" "Entire home/apt" "Private room" ...
##  $ amenities             : chr  "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Essentials,\"Hair dryer\",I"| __truncated__ "{\"Wireless Internet\",\"Air conditioning\",Kitchen,Heating,\"Family/kid friendly\",Washer,Dryer,\"Smoke detect"| __truncated__ "{TV,\"Cable TV\",\"Wireless Internet\",\"Air conditioning\",Kitchen,Breakfast,\"Buzzer/wireless intercom\",Heat"| __truncated__ "{Internet,\"Wireless Internet\",\"Air conditioning\",Kitchen,\"Pets allowed\",\"Buzzer/wireless intercom\",Heat"| __truncated__ ...
##  $ accommodates          : int  3 7 5 2 2 8 2 6 4 3 ...
##  $ bathrooms             : num  1 1 1 1 1 1 1.5 1 1 1 ...
##  $ bed_type              : chr  "Real Bed" "Real Bed" "Real Bed" "Real Bed" ...
##  $ cancellation_policy   : chr  "strict" "strict" "moderate" "strict" ...
##  $ cleaning_fee          : chr  "True" "True" "True" "True" ...
##  $ city                  : chr  "NYC" "NYC" "NYC" "NYC" ...
##  $ description           : chr  "Beautiful, sunlit brownstone 1-bedroom in the loveliest neighborhood in Brooklyn. Blocks from the promenade and"| __truncated__ "Enjoy travelling during your stay in Manhattan. My place is centrally located near Times Square and Central Par"| __truncated__ "The Oasis comes complete with a full backyard with outdoor furniture to make the most of this summer vacation!!"| __truncated__ "This is a bright bedroom in an historic building with exposed brick walls and modern appointments, situated in "| __truncated__ ...
##  $ first_review          : chr  "2016-06-18" "2017-08-05" "2017-04-30" "2013-04-28" ...
##  $ host_has_profile_pic  : chr  "t" "t" "t" "t" ...
##  $ host_identity_verified: chr  "t" "f" "t" "t" ...
##  $ host_response_rate    : chr  NA "100%" "100%" "100%" ...
##  $ host_since            : chr  "2012-03-26" "2017-06-19" "2016-10-25" "2013-03-27" ...
##  $ instant_bookable      : chr  "f" "t" "t" "f" ...
##  $ last_review           : chr  "2016-07-18" "2017-09-23" "2017-09-14" "2016-11-15" ...
##  $ latitude              : num  40.7 40.8 40.8 40.7 40.7 ...
##  $ longitude             : num  -74 -74 -73.9 -74 -74 ...
##  $ name                  : chr  "Beautiful brownstone 1-bedroom" "Superb 3BR Apt Located Near Times Square" "The Garden Oasis" "Large East Village Bedroom To Let!" ...
##  $ neighbourhood         : chr  "Brooklyn Heights" "Hell's Kitchen" "Harlem" "Alphabet City" ...
##  $ number_of_reviews     : int  2 6 10 82 26 5 57 14 4 40 ...
##  $ review_scores_rating  : num  100 93 92 93 86 72 89 100 100 88 ...
##  $ thumbnail_url         : chr  "https://a0.muscache.com/im/pictures/6d7cbbf7-c034-459c-bc82-6522c957627c.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/348a55fe-4b65-452a-b48a-bfecb3b58a66.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/6fae5362-9e3a-4fa9-aa54-bbd5ea26538d.jpg?aki_policy=small" "https://a0.muscache.com/im/pictures/21726900/19c843e1_original.jpg?aki_policy=small" ...
##  $ zipcode               : chr  "11201" "10019" "10027" "10009.0" ...
##  $ bedrooms              : num  1 3 1 1 1 3 1 3 2 1 ...
##  $ beds                  : num  1 3 3 1 2 3 1 3 2 2 ...
# 1. Histogram of log_price
ggplot(nyc_data, aes(x = log_price)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  labs(title = "Distribution of Log Price", x = "Log Price", y = "Count")

# 2. Boxplot of log_price
ggplot(nyc_data, aes(y = log_price)) +
  geom_boxplot(fill = "red", alpha = 0.5) +
  labs(title = "Boxplot of Log Price", y = "Log Price")

# 3. Scatter plot of log_price vs No. of Bedroom
# Remove rows with missing or non-finite values before plotting
  nyc_clean <- nyc_data %>% 
    filter(!is.na(bedrooms), !is.na(log_price), is.finite(log_price))
  
  ggplot(nyc_clean, aes(x = bedrooms, y = log_price)) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", se = FALSE, color = "blue") +
    labs(title = "Log Price vs Bedrooms", x = "Bedrooms", y = "Log Price")
## `geom_smooth()` using formula = 'y ~ x'

  # 4. Point plot log price by scores rating
# Clean the data for this specific plot since some data were missing
  nyc_clean2 <- nyc_data %>%
    filter(!is.na(log_price), !is.na(review_scores_rating), 
           is.finite(log_price), is.finite(review_scores_rating))

  ggplot(nyc_clean2, aes(x = log_price, y = review_scores_rating)) +
    geom_point(size = 1, alpha = 0.5) +
    labs(title = "Log Price vs Review Scores Rating", 
         x = "Log Price", 
         y = "Review Scores Rating")

# 5. Density plot of log_price
ggplot(nyc_data, aes(x = log_price)) +
  geom_density(fill = "green", alpha = 0.5) +
  labs(title = "Density Plot of Log Price", x = "Log Price", y = "Density")

B. Write a two-paragraph description that explains the choices that you made, and what the resulting plots show.

The five visualizations illustrate different perspectives on the distribution and relationship of property prices (log_price). #1,The histogram presents the frequency distribution of log_price, showing a roughly normal distribution with a peak around 4-5. #2, The boxplot highlights the presence of outliers, both in the lower and upper ranges, which could indicate extreme property values that may need further investigation. 

#3, The scatter plot of log price against the number of bedrooms suggests a positive correlation between the two variables, as expected in real estate pricing. The linear trend line reinforces this relationship, indicating that as the number of bedrooms increases, property prices generally rise. However, the plot also reveals some variability and outliers, meaning additional factors may influence price beyond just bedroom count. #4, The scatterplot has a massive concentration of points and is at review_scores_rating = 100, especially in the log_price range of 4.0 to 5.5. The scatterplot likely shows a cloud of points without a clear trend line. That suggests no strong linear relationship between price and rating. Customers tend to rate highly whether the price is low, medium, or high. Review score isn’t a strong discriminator of price.

Step II. Prediction

I. Create a multiple regression model with the outcome variable log_price.

# Prepare the regression data by selecting relevant variables and handling missing values
regression_data <- dplyr::select(nyc_data, 
 log_price, bedrooms, bathrooms, number_of_reviews, review_scores_rating, accommodates, 
 room_type, cancellation_policy, host_response_rate, instant_bookable, latitude, longitude) %>%
     drop_na()

# Impute missing values for continuous variables with median
regression_data$bedrooms <- ifelse(is.na(regression_data$bedrooms), 
                                   median(regression_data$bedrooms, na.rm = TRUE), 
                                   regression_data$bedrooms)
regression_data$bathrooms <- ifelse(is.na(regression_data$bathrooms), 
                                    median(regression_data$bathrooms, na.rm = TRUE), 
                                    regression_data$bathrooms)
regression_data$review_scores_rating <- ifelse(is.na(regression_data$review_scores_rating), 
                                               median(regression_data$review_scores_rating, na.rm = TRUE), 
                                               regression_data$review_scores_rating)
regression_data$host_response_rate <- ifelse(is.na(regression_data$host_response_rate), 
                                             median(regression_data$host_response_rate, na.rm = TRUE), 
                                             regression_data$host_response_rate)

# Impute missing values for categorical variables with mode
mode_function <- function(x) {
  uniq_x <- unique(x)
  uniq_x[which.max(tabulate(match(x, uniq_x)))]
}

regression_data$room_type <- ifelse(is.na(regression_data$room_type), 
                                    mode_function(regression_data$room_type), 
                                    regression_data$room_type)
regression_data$cancellation_policy <- ifelse(is.na(regression_data$cancellation_policy), 
                                              mode_function(regression_data$cancellation_policy), 
                                              regression_data$cancellation_policy)
regression_data$instant_bookable <- ifelse(is.na(regression_data$instant_bookable), 
                                           mode_function(regression_data$instant_bookable), 
                                           regression_data$instant_bookable)

# Convert categorical variables to factors
regression_data$room_type <- as.factor(regression_data$room_type)
regression_data$cancellation_policy <- as.factor(regression_data$cancellation_policy)
regression_data$instant_bookable <- as.factor(regression_data$instant_bookable)

A. Describe your process. How did you wind up including the independent variables that you kept, and discarding the ones that you didn’t keep? In a narrative of at least two paragraphs, discuss your process and your reasoning. In the write-up, be sure to talk about how you evaluated the quality of your model.

To make a model that predicts the log price of Airbnb listings, we first looked at all the columns in the dataset.  We picked the ones that we thought could affect the price — like how many bedrooms and bathrooms the listing has, the accommodation, how many reviews it has and its review score. We also added the location (latitude and longitude), the type of room (like entire home or shared room) and the cancellation policy. We removed things that wouldn’t help like the listing’s name, description or ID number — those don’t really affect price. We changed the room type and cancellation policy into categories that the model could understand.

After that, we ran the regression model and looked at each variable’s p-value. If the p-value was higher than 0.05 that means the variable probably doesn’t affect the price much — so we removed it. For example, cancellation_policy moderate had a high p-value, so we didn’t keep it. We  also made sure that our variables weren’t too similar to each other using VIF. 

To check how good our model is, we looked at two things:

  • R-squared tells us how much of the price differences model can explain.

  • RMSE tells us how far off our model’s guesses are from the real prices.

B. Show a screenshot of your regression summary, and explain the regression equation that it generated.

1. What is the r-squared of your model? What does this mean?

2. What is the RMSE of your model? What does this mean?

# Fit the linear regression model
model <- lm(log_price ~ bedrooms + bathrooms + number_of_reviews + 
                      review_scores_rating + accommodates + 
                      room_type + cancellation_policy + 
                      latitude + longitude,
                    data = regression_data)

# View the regression summary
options(scipen = 999)
summary(model)
## 
## Call:
## lm(formula = log_price ~ bedrooms + bathrooms + number_of_reviews + 
##     review_scores_rating + accommodates + room_type + cancellation_policy + 
##     latitude + longitude, data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6191 -0.2382 -0.0083  0.2224  2.4731 
## 
## Coefficients:
##                                         Estimate    Std. Error t value
## (Intercept)                        -345.45396394    5.32716608 -64.848
## bedrooms                              0.09921290    0.00524634  18.911
## bathrooms                             0.11289394    0.00753276  14.987
## number_of_reviews                    -0.00031425    0.00007381  -4.257
## review_scores_rating                  0.00473564    0.00037595  12.596
## accommodates                          0.07784139    0.00224690  34.644
## room_typePrivate room                -0.60274721    0.00683231 -88.220
## room_typeShared room                 -0.99729137    0.01832176 -54.432
## cancellation_policymoderate          -0.00623078    0.00848373  -0.734
## cancellation_policystrict             0.02617924    0.00771745   3.392
## cancellation_policysuper_strict_30    0.52560909    0.22382877   2.348
## cancellation_policysuper_strict_60    1.65563878    0.38742121   4.273
## latitude                              1.50581934    0.05155542  29.208
## longitude                            -3.89728759    0.06393523 -60.957
##                                                Pr(>|t|)    
## (Intercept)                        < 0.0000000000000002 ***
## bedrooms                           < 0.0000000000000002 ***
## bathrooms                          < 0.0000000000000002 ***
## number_of_reviews                             0.0000208 ***
## review_scores_rating               < 0.0000000000000002 ***
## accommodates                       < 0.0000000000000002 ***
## room_typePrivate room              < 0.0000000000000002 ***
## room_typeShared room               < 0.0000000000000002 ***
## cancellation_policymoderate                    0.462690    
## cancellation_policystrict                      0.000695 ***
## cancellation_policysuper_strict_30             0.018871 *  
## cancellation_policysuper_strict_60            0.0000193 ***
## latitude                           < 0.0000000000000002 ***
## longitude                          < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3873 on 19401 degrees of freedom
## Multiple R-squared:  0.6454, Adjusted R-squared:  0.6452 
## F-statistic:  2717 on 13 and 19401 DF,  p-value: < 0.00000000000000022
# Calculate RMSE
predictions <- predict(model, newdata = regression_data)
rmse <- sqrt(mean((predictions - regression_data$log_price)^2))
cat("RMSE: ", rmse, "\n")
## RMSE:  0.3871815
# Evaluate R-squared
rsq <- summary(model)$r.squared
cat("R-squared: ", rsq, "\n")
## R-squared:  0.645433

Regression Formula:

log_price = -345.45 

          + 0.0992 * bedrooms 

          + 0.1129 * bathrooms 

          - 0.0003 * number_of_reviews 

          + 0.0047 * review_scores_rating 

          + 0.0778 * accommodates 

          - 0.6027 * room_typePrivate room 

          - 0.9973 * room_typeShared room 

          + 0.0262 * cancellation_policystrict 

          + 0.5256 * cancellation_policysuper_strict_30 

          + 1.6556 * cancellation_policysuper_strict_60 

          + 1.5058 * latitude 

          - 3.8973 * longitude

Each number (called a “coefficient”) shows how the price changes when one thing increases. For example:

  • More bedrooms = higher price

  • Private rooms and shared rooms = lower price than entire homes

  • Higher rating = higher price

  • Some cancellation policies = higher prices.

The R-squared is 0.6454, which means our model explains about 64.5% of why prices are different. That’s pretty good. It means most of the important things that affect price are in the model.

The RMSE is 0.3872, which means that our model’s predictions are off by about 0.39 (on the log scale) on average. The smaller the RMSE, the better. So, this tells us the model is doing a decent job.

Step III: Classification (40 points)

Part I. Using k-nearest neighbors, predict whether a rental in your city would have a cleaning fee. Use any set of predictors in order to build this model.

A. Show the code you used to run your model, and the code you used to assess your model.

# Load necessary libraries for KNN classification
library(caret)
library(class)
## 
## Attaching package: 'class'
## The following objects are masked from 'package:FNN':
## 
##     knn, knn.cv
# Prepare the data for classification
nyc_data_clean <- dplyr::select(nyc_data, cleaning_fee, log_price, bedrooms, bathrooms, number_of_reviews, 
                                review_scores_rating, accommodates, room_type, cancellation_policy, 
                                host_response_rate, instant_bookable, latitude, longitude)

# Convert cleaning_fee: TRUE/FALSE to factor ("Yes"/"No")
nyc_data_clean$cleaning_fee <- ifelse(nyc_data_clean$cleaning_fee, "Yes", "No")
nyc_data_clean$cleaning_fee <- as.factor(nyc_data_clean$cleaning_fee)

# Select predictors for KNN classification
predictors <- c("bedrooms", "bathrooms", "number_of_reviews", "review_scores_rating", 
                "accommodates", "room_type", "latitude", "longitude")

# Prepare feature matrix and target vector
X <- nyc_data_clean[, predictors]
y <- nyc_data_clean$cleaning_fee

# Split data into training and testing sets
set.seed(123)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)
train_X <- X[train_index, ]
train_Y <- y[train_index]
test_X <- X[-train_index, ]
test_Y <- y[-train_index]

# Impute missing numeric values using median
numeric_cols <- c("bedrooms", "bathrooms", "review_scores_rating")
for (col in numeric_cols) {
  train_X[[col]][is.na(train_X[[col]])] <- median(train_X[[col]], na.rm = TRUE)
  test_X[[col]][is.na(test_X[[col]])] <- median(test_X[[col]], na.rm = TRUE)
}

# One-hot encode 'room_type'
train_X <- cbind(train_X[, -which(names(train_X) == "room_type")],
                 model.matrix(~ room_type - 1, data = train_X))
test_X <- cbind(test_X[, -which(names(test_X) == "room_type")],
                model.matrix(~ room_type - 1, data = test_X))

# Scale numeric features
train_X_scaled <- scale(train_X)
test_X_scaled <- scale(test_X, center = attr(train_X_scaled, "scaled:center"),
                       scale = attr(train_X_scaled, "scaled:scale"))

# Try different k values and find the one with best accuracy
accuracy <- sapply(1:20, function(k) {
  pred <- knn(train = train_X_scaled, test = test_X_scaled, cl = train_Y, k = k)
  mean(pred == test_Y)
})

# Best k and accuracy
best_k <- which.max(accuracy)
cat("Best k:", best_k, "\n")
## Best k: 19
cat("Best Accuracy:", accuracy[best_k], "\n")
## Best Accuracy: 0.7073736
# Final KNN prediction with best_k
final_pred <- knn(train = train_X_scaled, test = test_X_scaled, cl = train_Y, k = best_k)

# Evaluate the model
conf_matrix <- confusionMatrix(final_pred, test_Y)
print(conf_matrix)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   No  Yes
##        No   479  480
##        Yes 1413 4097
##                                              
##                Accuracy : 0.7074             
##                  95% CI : (0.6961, 0.7184)   
##     No Information Rate : 0.7075             
##     P-Value [Acc > NIR] : 0.5171             
##                                              
##                   Kappa : 0.1734             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.25317            
##             Specificity : 0.89513            
##          Pos Pred Value : 0.49948            
##          Neg Pred Value : 0.74356            
##              Prevalence : 0.29247            
##          Detection Rate : 0.07405            
##    Detection Prevalence : 0.14825            
##       Balanced Accuracy : 0.57415            
##                                              
##        'Positive' Class : No                 
## 

B. Write a two-paragraph narrative that describes how you did this. In your narrative, be sure to describe your predictor choices, and mention how you arrived at the particular k value that you used.

In this part, we wanted to predict whether a listing has a cleaning fee or not (yes or no). To do this, we used a machine learning method called K-nearest neighbors (KNN). We first created a new column that showed 1 if there was a cleaning fee and 0 if there wasn’t. Then we selected some variables like bedrooms, bathrooms, how many guests it can accommodate, number of reviews, review scores and room type.

Since KNN works best with numbers, we converted the room type into numeric values and scaled the data, so all features were on the same scale. We then split the data into a training set (for learning) and a testing set (to check how well the model performs).

We tested different values for k (the number of neighbors to consider) and found that k = 19 gave the best accuracy. This means the model looked at the 19 listings most similar to the one it was predicting and based its decision. Finally, we checked the results using a confusion matrix, which showed how many predictions were correct or incorrect. 

Part II. Naive Bayes

C. Using the log_price variable, create four similarly-sized bins, or categories, for the rental prices in your city (Pricey Digs, Above Average, Below Average, Student Budget).

The host response rate variable was converted to numerical then binned. We wanted to see if reducing the amount of levels for the host response rate and binning it would make a good predictor variable for the model. Cancellation policy was consolidated into three levels for naive bayes and decision tree.

#removed NA from df 
nyc_data_clean <- nyc_data %>%
  drop_na()

#Naive Bayes 
nyc_naive.df <- nyc_data_clean
#str(nyc_naive.df)


#run this line of code to clear global environ except for orig df 
#rm(list = setdiff(ls(), "train_csv"))

#convert chr var to factors 
#ignoring city b/c data has been filtered for NYC 
nyc_naive.df$property_type<-as.factor(nyc_naive.df$property_type)
nyc_naive.df$room_type<-as.factor(nyc_naive.df$room_type)
nyc_naive.df$amenities<-as.factor(nyc_naive.df$amenities)
nyc_naive.df$bed_type<-as.factor(nyc_naive.df$bed_type)
nyc_naive.df$zipcode<-as.factor(nyc_naive.df$zipcode)
nyc_naive.df$thumbnail_url<-as.factor(nyc_naive.df$thumbnail_url)
nyc_naive.df$name<-as.factor(nyc_naive.df$name)
nyc_naive.df$neighbourhood<-as.factor(nyc_naive.df$neighbourhood)
#nyc_naive.df$host_response_rate<-as.factor(nyc_naive.df$host_response_rate)
nyc_naive.df$description<-as.factor(nyc_naive.df$description)

#Converted host_response_rate to num then to factor 
#summary(nyc_naive.df$host_response_rate)

nyc_naive.df$host_response_rate <- as.numeric(gsub("%", "", nyc_naive.df$host_response_rate))

nyc_naive.df$host_response_rate <- cut(
  nyc_naive.df$host_response_rate,
  breaks = c(0, 50, 75, 95, 100),
  labels = c("never", "sometimes", "most times", "always"),
  include.lowest = TRUE,
  right = TRUE
)


summary(nyc_naive.df$host_response_rate)
##      never  sometimes most times     always 
##        442        711       2280      14423
#created three outcome classes for cancellation policy 

nyc_naive.df$cancellation_policy[nyc_naive.df$cancellation_policy %in% c("super_strict_30", "super_strict_60")] <- "strict"

nyc_naive.df$cancellation_policy<-as.factor(nyc_naive.df$cancellation_policy)
summary(nyc_naive.df$cancellation_policy)
## flexible moderate   strict 
##     3333     4784     9739
str(nyc_naive.df)
## 'data.frame':    17856 obs. of  29 variables:
##  $ id                    : int  6304928 7919400 5578513 18224863 16679342 14122244 14490881 4734170 421056 15705862 ...
##  $ log_price             : num  5.13 4.98 4.61 4.6 3.69 ...
##  $ property_type         : Factor w/ 24 levels "Apartment","Bed & Breakfast",..: 1 1 1 15 15 1 1 1 1 1 ...
##  $ room_type             : Factor w/ 3 levels "Entire home/apt",..: 1 1 2 1 2 1 1 2 2 2 ...
##  $ amenities             : Factor w/ 16951 levels "{\"Air conditioning\",\"Pets allowed\",Heating,\"Smoke detector\",\"Carbon monoxide detector\",Essentials,Shamp"| __truncated__,..: 2081 6325 4216 11930 3219 8290 6359 15925 7182 6076 ...
##  $ accommodates          : int  7 5 2 8 2 6 4 3 2 2 ...
##  $ bathrooms             : num  1 1 1 1 1.5 1 1 1 1 1 ...
##  $ bed_type              : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ cancellation_policy   : Factor w/ 3 levels "flexible","moderate",..: 3 2 3 3 3 3 3 3 3 2 ...
##  $ cleaning_fee          : chr  "True" "True" "True" "True" ...
##  $ city                  : chr  "NYC" "NYC" "NYC" "NYC" ...
##  $ description           : Factor w/ 17724 levels "         Great sunny second floor apartment in Brooklyn, Bedstuy. The building is a new construction with eleva"| __truncated__,..: 4835 13822 14838 6306 4283 9109 14380 10118 13613 5805 ...
##  $ first_review          : chr  "2017-08-05" "2017-04-30" "2013-04-28" "2017-07-30" ...
##  $ host_has_profile_pic  : chr  "t" "t" "t" "t" ...
##  $ host_identity_verified: chr  "f" "t" "t" "f" ...
##  $ host_response_rate    : Factor w/ 4 levels "never","sometimes",..: 4 4 4 4 4 4 4 2 2 3 ...
##  $ host_since            : chr  "2017-06-19" "2016-10-25" "2013-03-27" "2017-05-09" ...
##  $ instant_bookable      : chr  "t" "t" "f" "t" ...
##  $ last_review           : chr  "2017-09-23" "2017-09-14" "2016-11-15" "2017-09-15" ...
##  $ latitude              : num  40.8 40.8 40.7 40.6 40.7 ...
##  $ longitude             : num  -74 -73.9 -74 -74 -73.9 ...
##  $ name                  : Factor w/ 17754 levels " 1 Bed Apt in Utopic Williamsburg ",..: 16396 16545 9396 891 16585 9093 10269 3935 11114 9042 ...
##  $ neighbourhood         : Factor w/ 193 levels "Allerton","Alphabet City",..: 85 84 2 63 53 93 68 127 99 22 ...
##  $ number_of_reviews     : int  6 10 82 5 57 14 4 40 1 17 ...
##  $ review_scores_rating  : num  93 92 93 72 89 100 100 88 100 96 ...
##  $ thumbnail_url         : Factor w/ 17854 levels "https://a0.muscache.com/im/pictures/0000ec1e-e6de-4c4b-b244-be89b344267b.jpg?aki_policy=small",..: 4461 8985 2835 16641 7300 516 15027 10440 4453 2561 ...
##  $ zipcode               : Factor w/ 260 levels "10001","10001.0",..: 29 38 14 180 155 231 222 25 177 154 ...
##  $ bedrooms              : num  3 1 1 3 1 3 2 1 1 1 ...
##  $ beds                  : num  3 3 1 3 1 3 2 2 1 1 ...

The variables log_prices and beds were filtered to only include rows with values greater than 0. The numeric variables were binned and different binning strategies were used to test against the modeling accuracy. For example, initially the log_price variable was binned using the the variables summary stats but this created skewed bins.

#Convert num var to factors 
#ignoring ID b/c its just an identifier for apt and shouldn't have an affect on the price 
#ignoring latitude and longitude since factoring zip code & neighborhood

#convert log_price to factors 
summary(nyc_naive.df$log_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.248   4.654   4.719   5.165   7.600
price <- nyc_naive.df %>%filter(log_price <1)
price
##         id log_price property_type   room_type
## 1 17972519         0   Condominium Shared room
##                                                                                                                                                                               amenities
## 1 {TV,"Wireless Internet","Air conditioning",Kitchen,Elevator,"Indoor fireplace","Buzzer/wireless intercom",Heating,"Smoke detector","Carbon monoxide detector",Iron,"Host greets you"}
##   accommodates bathrooms bed_type cancellation_policy cleaning_fee city
## 1            1         1 Real Bed              strict         True  NYC
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                description
## 1 Comfortable dwellings .. My comfy home is for guest travelers and college students on a budget..!! I'm a 3min walk to the 6 train which is 30 mins away from Grand Central Station, Times Square, 34th St and a lot of other tourist areas..  Most of my neighbours are homeowners and are friendly and keep to themselves.. Any questions feel free to ponder then ask.. Comfy dwellings that will make you feel at home.. 40" Mounted TV with Amazon Fire-Stick including Netflix, Hulu, Youtube, Music etc. Access to the kitchen, bathroom and livingroom.. Microwave, toaster, utensils and cutlery to accommodate your needs.. Fresh towels and wash cloths provided.. My cohost and I will provide assistance as needed throughout your stay.. My neighbourhood is located in East-Central part of the Bronx which is 30 mins from Midtown Manhattan.. There is the convenience of several local bus lines that serve my neighbourhood and the 6 train that serves the community..  Macy's, Applebees, IHOP, department stores as
##   first_review host_has_profile_pic host_identity_verified host_response_rate
## 1   2017-08-17                    t                      f             always
##   host_since instant_bookable last_review latitude longitude           name
## 1 2017-07-20                t  2017-10-01 40.83805 -73.85867 Esteem's Place
##   neighbourhood number_of_reviews review_scores_rating
## 1   Parkchester                 3                   87
##                                                                                   thumbnail_url
## 1 https://a0.muscache.com/im/pictures/71bc085f-b869-4cda-8b4f-dca694ebbd26.jpg?aki_policy=small
##   zipcode bedrooms beds
## 1   10462        1    1
beds <-nyc_naive.df %>%filter(beds <1)
beds
##  [1] id                     log_price              property_type         
##  [4] room_type              amenities              accommodates          
##  [7] bathrooms              bed_type               cancellation_policy   
## [10] cleaning_fee           city                   description           
## [13] first_review           host_has_profile_pic   host_identity_verified
## [16] host_response_rate     host_since             instant_bookable      
## [19] last_review            latitude               longitude             
## [22] name                   neighbourhood          number_of_reviews     
## [25] review_scores_rating   thumbnail_url          zipcode               
## [28] bedrooms               beds                  
## <0 rows> (or 0-length row.names)
summary(nyc_data_clean$log_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   4.248   4.654   4.719   5.165   7.600
#filter to remove one row with price of 0 and 0 for beds 
nyc_naive.df <- nyc_naive.df %>% filter(log_price > 0)
nyc_naive.df <- nyc_naive.df%>% filter(beds > 0)
summary(nyc_naive.df$log_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.609   4.248   4.654   4.720   5.165   7.600
#table(nyc_naive.df$log_price)
nyc_naive.df$log_price<-cut(nyc_naive.df$log_price,breaks=
                              quantile(nyc_naive.df$log_price,
                                                  probs=seq(0,1,by=0.25),
                                                  na.rm=TRUE),labels=c("Student Budget","Below Average","Above Average", "Pricey Digs"),include.lowest=TRUE)


#initial binning using summary() stats 
#nyc_naive.df$log_price<- cut(nyc_naive.df$
                               #log_price,breaks =c(1,4.248,4.654,5.165,8),
                               #labels=c("Student Budget","Below Average",
                                       #"Above Average", "Pricey Digs"),
                               #include.lowest=TRUE)

summary(nyc_naive.df$log_price)
## Student Budget  Below Average  Above Average    Pricey Digs 
##           4767           4177           4672           4239
#convert accommodates to factors 
summary(nyc_naive.df$accommodates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   3.034   4.000  16.000
#table(nyc_naive.df$accommodates)

#orig binning based off summary()stats
#nyc_naive.df$accommodates<- cut(nyc_naive.df$accommodates,breaks =c(1,2,3,4,16),
                       #labels=c("studio", "small", "medium", "large"),
                       #include.lowest=TRUE)

nyc_naive.df$accommodates <- cut2(nyc_naive.df$accommodates, g = 5)
levels(nyc_naive.df$accommodates) <- c("studio", "small", "med", "large")
summary(nyc_naive.df$accommodates)
## studio  small    med  large 
##  10302   2059   2772   2722
#convert bathrooms to factors
#orig binning attempt based on summary() stats
summary(nyc_naive.df$bathrooms)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   1.000   1.137   1.000   6.000
#nyc_naive.df$bathrooms<- cut(nyc_naive.df$bathrooms,breaks =c(-1,1,2.5,8),
                          #labels=c("one or less","two", "multiple"),
                          #include.lowest=TRUE)

nyc_naive.df$bathrooms <- cut2(nyc_naive.df$bathrooms, g = 2)
#levels(nyc_naive.df$bathrooms)
levels(nyc_naive.df$bathrooms) <- c("one or less", "multiple")
summary(nyc_naive.df$bathrooms)
## one or less    multiple 
##       15397        2458
#convert reviews to factors 
summary(nyc_naive.df$number_of_reviews)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00   14.00   28.76   37.00  465.00
table(nyc_naive.df$review_scores_rating)
## 
##   20   27   40   47   50   53   55   56   57   60   62   63   64   65   66   67 
##   25    1   17    2    6    1    1    1    2  109    1    2    3   10    1   16 
##   68   69   70   71   72   73   74   75   76   77   78   79   80   81   82   83 
##    7    2   46    3    9   45   16   40   25   28   36   33  529   55   81  140 
##   84   85   86   87   88   89   90   91   92   93   94   95   96   97   98   99 
##  169  229  232  386  418  467  920  635  820 1279 1023 1215 1330 1252 1232  661 
##  100 
## 4294
nyc_naive.df$number_of_reviews<- cut(nyc_naive.df$number_of_reviews,breaks =c(0,1,5,20,474),
                       labels=c("none","low", "average", "high"),
                       include.lowest=TRUE)
summary(nyc_naive.df$number_of_reviews)
##    none     low average    high 
##    1351    3904    5525    7075
#convert ratings to factors 
summary(nyc_naive.df$review_scores_rating)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   20.00   91.00   95.00   93.76   99.00  100.00
nyc_naive.df$review_scores_rating<- cut(nyc_naive.df$review_scores_rating,breaks =c(20,91,96,100),
                               labels=c("low","high", "top"),
                               include.lowest=TRUE)

#tried binning this way but recieved error 
#nyc_naive.df$review_scores_rating <- cut2(nyc_naive.df$review_scores_rating, g = 3)
#levels(nyc_naive.df$review_scores_rating) <- c("low", "high", "top")
#summary(nyc_naive.df$review_scores_rating)

#bedrooms to factors 
#summary(nyc_naive.df$bedrooms)
nyc_naive.df$bedrooms<- cut(nyc_naive.df$bedrooms,breaks =c(-1,0,1,10),
                                  labels=c("studio","one", "multiple"),
                                  include.lowest=TRUE)

summary(nyc_naive.df$beds)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.667   2.000  18.000
nyc_naive.df$beds<- cut(nyc_naive.df$beds,breaks =c(0,1,2,18),
                      labels=c("one","two","many"),
                      include.lowest=TRUE)
summary(nyc_naive.df$beds)
##   one   two  many 
## 11123  3946  2786

The logical variables in the data set were factored as well.

#turn log var to factors 
nyc_naive.df$host_identity_verified<-as.factor(nyc_naive.df$
                                                  host_identity_verified)
nyc_naive.df$cleaning_fee <- as.factor(nyc_naive.df$cleaning_fee)
nyc_naive.df$host_has_profile_pic <- as.factor(nyc_naive.df$
                                                 host_has_profile_pic)
nyc_naive.df$instant_bookable <- as.factor(nyc_naive.df$instant_bookable)

D. Using any five predictors, build a model using the naive Bayes algorithm. Describe a fictional apartment, and use your model to predict which bin it will fall into.

The dataset was partitioned into a training data frame with 60% of the data and a validation data frame with 40% of the data.

#split data 60/40 
set.seed(1)
train_2.index <- sample(rownames(nyc_naive.df),dim(nyc_naive.df)[1]*0.6)
valid_2.index <- setdiff(rownames(nyc_naive.df), train_2.index)
train_2.df <- nyc_naive.df[train_2.index, ]
valid_2.df <- nyc_naive.df[valid_2.index, ]

Bar plots were used to explore the relationships between predictor variables and price categories, providing insight into which predictors were most relevant for inclusion in the Naive Bayes model.

#View bar plots to view relationships between price 

ggplot(train_2.df,aes(x=room_type, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by Room Type", y = "Proportion" ) +
  theme_minimal()

#room type seems to have an effect on price 

Room type seems to have an effect on the price of a listing. For example, entire homes and apartments have more pricey listings compared to private and shared rooms. But, listings at the student budget and below average rates are both heavily prevalent in the private and shared rooms.

ggplot(train_2.df,aes(x=accommodates, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by Accommodates", y = "Proportion" ) +
  theme_minimal()

#The amount of people seems to have an effect on price 

The amount of people the listing can accommodate seems to also have an effect on the price. Listings that can accommodate more people have more expensive listings compared to lower accommodations.

ggplot(train_2.df,aes(x=bathrooms, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by # of Bathrooms", y = "Proportion" ) +
  theme_minimal()

#Number of bathrooms seems to have an effect on price

The number of bathrooms in a listing appears to have less influence on price compared to other variables. Interestingly, listings with one or fewer bathrooms show a higher proportion of above average price listings than those with multiple bathrooms. This may be explained by the greater representation of one bathroom listings in the dataset, which could be skewing the observed relationship.

ggplot(train_2.df,aes(x=host_identity_verified, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by Host Verification", y = "Proportion" ) +
  theme_minimal()

#host verification doesn't seem to have much effect on price

Host identify verification doesn’t seem to have much of an effect on listing prices. All levels of the listing price are almost proportioned equally for hosts that are or are not verified. Host verification was not chosen for the naive bayes model.

ggplot(train_2.df,aes(x=instant_bookable, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by Booking Option", y = "Proportion" ) +
  theme_minimal()

#book option doesn't seem to have much effect on price 

Booking options also appear to have minimal impact on listing price. All price categories show similar proportions across the different booking options, suggesting that this variable does not strongly influence pricing.

ggplot(train_2.df,aes(x=review_scores_rating, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by review ratings", y = "Proportion" ) +
  theme_minimal()

#review scores doesn't seems to have much effect on price

Review ratings also seems to have low impact on the listing price.

ggplot(train_2.df,aes(x=bedrooms, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by # of Bedrooms", y = "Proportion" ) +
  theme_minimal()

#number bedrooms seems to have an effect on price 

The number of bedrooms in a listing appears to have a stronger influence on price compared to other variables. For instance, listings with multiple bedrooms are less commonly found in the student budget category, whereas one bedroom listings are more prevalent. Interestingly, studio apartments do not show as much representation in the student budget category.

ggplot(train_2.df,aes(x=beds, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by # of Beds", y = "Proportion" ) +
  theme_minimal()

#number of beds seems to have an effect on price

The number of beds seems to also have a larger impact on the price of a listing compared to other variables.

ggplot(train_2.df,aes(x=cancellation_policy, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by cancellation policy", y = "Proportion" ) +
  theme_minimal()

#cancellation policy seems to have small effect on price 

The cancellation policy has a smaller impact on the listing price compared to other variables.

ggplot(train_2.df,aes(x=cleaning_fee, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by cleaning fee", y = "Proportion" ) +
  theme_minimal()

#cleaning fee seems to have an effect on price 

Cleaning fee seems to have a lower impact on the price of a listing compared to other variables.

ggplot(train_2.df,aes(x=host_response_rate, fill = log_price))+
  geom_bar(position="fill") +
  labs(title="Price by host response rate", y = "Proportion" ) +
  theme_minimal()

#host response rate seems to not have an effect on price ```{r}

Host response rate doesn’t seem to have much of an impact on the price of a listing.

E. Show a screenshot of the code you used to build your model, the code you used to run the algorithm, and code you used to assess the algorithm.

#selecting five variables 
train_2.df <- select(train_2.df,log_price,room_type,accommodates
                     ,bathrooms,bedrooms,cancellation_policy)
valid_2.df <-select(valid_2.df,log_price,room_type,accommodates
                    , bathrooms,bedrooms,cancellation_policy)
#initially chose room_type,review_scores_rating
#bathrooms,beds, cleaning_fee with 49% accuracy rate 

#room_type,review_scores_rating,bathrooms,bedrooms, cleaning_fee accuracy of52%

#room_type,review_scores_rating,bathrooms,bedrooms,cancellation_policy 52%

#room_type,accommodates, bathrooms,bedrooms,cancellation_policy 52%
#check 
names(train_2.df)
## [1] "log_price"           "room_type"           "accommodates"       
## [4] "bathrooms"           "bedrooms"            "cancellation_policy"
names(valid_2.df)
## [1] "log_price"           "room_type"           "accommodates"       
## [4] "bathrooms"           "bedrooms"            "cancellation_policy"
#run naive bayes 
price.nb <- naiveBayes(log_price ~ ., data = train_2.df)
price.nb
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Student Budget  Below Average  Above Average    Pricey Digs 
##      0.2648184      0.2356950      0.2577243      0.2417623 
## 
## Conditional probabilities:
##                 room_type
## Y                Entire home/apt Private room  Shared room
##   Student Budget    0.0440606274 0.8829749736 0.0729643990
##   Below Average     0.3120792079 0.6689108911 0.0190099010
##   Above Average     0.7547989859 0.2404925752 0.0047084390
##   Pricey Digs       0.9528957529 0.0467181467 0.0003861004
## 
##                 accommodates
## Y                    studio      small        med      large
##   Student Budget 0.86253084 0.08248149 0.04194572 0.01304195
##   Below Average  0.73940594 0.10613861 0.11326733 0.04118812
##   Above Average  0.47736327 0.15682724 0.22600507 0.13980442
##   Pricey Digs    0.20579151 0.12123552 0.26061776 0.41235521
## 
##                 bathrooms
## Y                one or less   multiple
##   Student Budget  0.85618611 0.14381389
##   Below Average   0.91247525 0.08752475
##   Above Average   0.93734154 0.06265846
##   Pricey Digs     0.74092664 0.25907336
## 
##                 bedrooms
## Y                    studio        one   multiple
##   Student Budget 0.02502644 0.94888967 0.02608389
##   Below Average  0.08118812 0.84633663 0.07247525
##   Above Average  0.14885911 0.65447302 0.19666787
##   Pricey Digs    0.09459459 0.39266409 0.51274131
## 
##                 cancellation_policy
## Y                 flexible  moderate    strict
##   Student Budget 0.2428622 0.2759958 0.4811421
##   Below Average  0.2344554 0.2942574 0.4712871
##   Above Average  0.1647954 0.2966317 0.5385730
##   Pricey Digs    0.1135135 0.2154440 0.6710425
#training data 
confusionMatrix(predict(price.nb,newdata=train_2.df),train_2.df$log_price)
## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Student Budget Below Average Above Average Pricey Digs
##   Student Budget           2544          1596           584          85
##   Below Average             147           172           145          66
##   Above Average             121           578          1418        1000
##   Pricey Digs                25           179           614        1439
## 
## Overall Statistics
##                                                
##                Accuracy : 0.5202               
##                  95% CI : (0.5107, 0.5297)     
##     No Information Rate : 0.2648               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3547               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: Student Budget Class: Below Average
## Sensitivity                         0.8967              0.06812
## Specificity                         0.7124              0.95628
## Pos Pred Value                      0.5290              0.32453
## Neg Pred Value                      0.9504              0.76893
## Prevalence                          0.2648              0.23569
## Detection Rate                      0.2375              0.01606
## Detection Prevalence                0.4489              0.04947
## Balanced Accuracy                   0.8046              0.51220
##                      Class: Above Average Class: Pricey Digs
## Sensitivity                        0.5136             0.5556
## Specificity                        0.7863             0.8993
## Pos Pred Value                     0.4549             0.6376
## Neg Pred Value                     0.8232             0.8639
## Prevalence                         0.2577             0.2418
## Detection Rate                     0.1324             0.1343
## Detection Prevalence               0.2910             0.2107
## Balanced Accuracy                  0.6500             0.7274
# holdout data 
confusionMatrix(predict(price.nb, newdata=valid_2.df), valid_2.df$log_price)
## Confusion Matrix and Statistics
## 
##                 Reference
## Prediction       Student Budget Below Average Above Average Pricey Digs
##   Student Budget           1710          1044           388          61
##   Below Average              98           103            99          43
##   Above Average              98           382           972         615
##   Pricey Digs                24           123           452         930
## 
## Overall Statistics
##                                                
##                Accuracy : 0.5202               
##                  95% CI : (0.5085, 0.5318)     
##     No Information Rate : 0.2702               
##     P-Value [Acc > NIR] : < 0.00000000000000022
##                                                
##                   Kappa : 0.3523               
##                                                
##  Mcnemar's Test P-Value : < 0.00000000000000022
## 
## Statistics by Class:
## 
##                      Class: Student Budget Class: Below Average
## Sensitivity                         0.8860              0.06235
## Specificity                         0.7135              0.95628
## Pos Pred Value                      0.5339              0.30029
## Neg Pred Value                      0.9441              0.77217
## Prevalence                          0.2702              0.23131
## Detection Rate                      0.2394              0.01442
## Detection Prevalence                0.4485              0.04803
## Balanced Accuracy                   0.7998              0.50932
##                      Class: Above Average Class: Pricey Digs
## Sensitivity                        0.5086             0.5640
## Specificity                        0.7907             0.8910
## Pos Pred Value                     0.4702             0.6082
## Neg Pred Value                     0.8150             0.8719
## Prevalence                         0.2676             0.2309
## Detection Rate                     0.1361             0.1302
## Detection Prevalence               0.2894             0.2141
## Balanced Accuracy                  0.6497             0.7275
price.nb$tables$room_type
##                 room_type
## Y                Entire home/apt Private room  Shared room
##   Student Budget    0.0440606274 0.8829749736 0.0729643990
##   Below Average     0.3120792079 0.6689108911 0.0190099010
##   Above Average     0.7547989859 0.2404925752 0.0047084390
##   Pricey Digs       0.9528957529 0.0467181467 0.0003861004
## predict probabilities
pred.prob <- predict(price.nb, newdata=valid_2.df, type="raw")
## predict class membership
pred.class <- predict(price.nb, newdata=valid_2.df)

df_2 <- data.frame(actual=valid_2.df$log_price, predicted=pred.class, pred.prob)

test_apt <- data.frame(cancellation_policy="moderate",accommodates="medium",
                       room_type = "Entire home/apt", bathrooms ="multiple",
                       bedrooms= "multiple")

pred_prob <- predict(price.nb, newdata=test_apt, type="raw")
pred_class <- predict(price.nb, newdata = test_apt)
pred_prob
##      Student Budget Below Average Above Average Pricey Digs
## [1,]    0.001620732    0.01842043    0.09540119   0.8845576
pred_class
## [1] Pricey Digs
## Levels: Student Budget Below Average Above Average Pricey Digs
#predicted Above Average 

Lift charts were used to evaluate the model’s performance relative to random guessing. The results indicate that, for each price category, the model consistently outperforms random selection.

#Lift chart for Pricey Digs 


actual_pricey <- factor(ifelse(valid_2.df$log_price == "Pricey Digs", "yes","no"
                               ))
prob_pricey <-pred.prob[,"Pricey Digs"]
lift_pricey<- lift(actual_pricey ~ prob_pricey,data=data.frame(actual_pricey,prob_pricey), class="yes")


plot(lift_pricey,main = "Lift Chart for Predicting Pricey Digs")

#lift chart for below average

actual_below<- factor(ifelse(valid_2.df$log_price == "Below Average", "yes","no"))
prob_below <-pred.prob[,"Below Average"]
lift_below<- lift(actual_below ~ prob_below,data=data.frame(actual_below,prob_below), class="yes")


plot(lift_below,main = "Lift Chart for Predicting Below Budget")

#lift chart for student budget

actual_budget <- factor(ifelse(valid_2.df$log_price == "Student Budget", "yes","no"))
prob_budget <-pred.prob[,"Student Budget"]
lift_budget<- lift(actual_budget ~ prob_budget,data=data.frame(actual_budget,prob_budget), class="yes")


plot(lift_budget,main = "Lift Chart for Predicting Student Budget")

#lift chart for above average

actual_above <- factor(ifelse(valid_2.df$log_price == "Above Average", "yes","no"))
prob_above <-pred.prob[,"Above Average"]
lift_above<- lift(actual_above ~ prob_above,data=data.frame(actual_above,prob_above), class="yes")

 
plot(lift_above,main = "Lift Chart for Predicting Above Average")

F. Write a two-paragraph narrative that describes how you did this. In your narrative, be sure to talk about things like factor selection and testing against your training data.

A Naive Bayes model was built to classify listings into four categories: student budget, below average, above average, and pricey digs. Character and logical variables were converted to factors, and numeric variables were binned either by their summary statistics or equal-frequency.Because the log_price distribution was skewed, it was too binned using equal-frequency. The final five variables that were selected as predictors for the model were: room type, accommodates, bathrooms, bedrooms, and cancellation policy.

Despite experimenting with different binning techniques and variable combinations, the model’s accuracy remained between 49% - 52%. However, lift chart analysis confirms it still outperforms random guessing. Its accuracy on the validation set closely matched that on the training set, indicating stable performance. A fictional apartment listing was created with the following variable attributes: room type: Entire home/apt, bathrooms: two, bedrooms: 2, accommodates: 4, and cancellation policy: strict. When we tested the fictional apartment listing, the model classified it as “Above Average.”

Part III. Classification Tree.

A. Build a classification tree that predicts the cancellation policy of an airbnb rental listing, with the outcome classes “flexible”, “moderate”, and “strict.”

#classification tree 
levels(nyc_data_clean$cancellation_policy)
## NULL
summary(nyc_data_clean$cancellation_policy)
##    Length     Class      Mode 
##     17856 character character
#str(nyc_data_clean)

#remove neighborhoud, thumbnail,zipcode,name,id,dates,description, lat, long
#b/c too many levels & chr var 
#factor bed_type and host_response rate, prop type, room_type

nyc_tree.df <- nyc_data_clean %>%
  # Drop ID and zip code columns.
  select(-c(id,city,description,amenities,first_review,host_since,last_review,
            name,neighbourhood,thumbnail_url,zipcode, latitude, longitude)) 
#str(nyc_tree.df)
#table(nyc_tree.df$host_response_rate)
nyc_tree.df$host_response_rate <- as.numeric(gsub("%", "", nyc_tree.df$host_response_rate))
nyc_tree.df$host_response_rate<-cut(nyc_tree.df$
                                      host_response_rate,breaks=c(0,50,75,
                                                                  100),
                                      labels=c("rarely","sometimes","always"),
                                      include.lowest=TRUE)
summary(nyc_tree.df$host_response_rate)
##    rarely sometimes    always 
##       442       711     16703
#bed_type,room_type,property_type from chr to factor 
nyc_tree.df$bed_type<-as.factor(nyc_tree.df$bed_type)

#table(nyc_tree.df$room_type)
nyc_tree.df$room_type<-as.factor(nyc_tree.df$room_type)

#table(nyc_tree.df$property_type)
nyc_tree.df$property_type<-as.factor(nyc_tree.df$property_type)


#bin cancellation policy to: flexible, moderate, strict 
#nyc_tree.df$cancellation_policy <-as.character(nyc_tree.df$cancellation_policy)

nyc_tree.df$cancellation_policy[nyc_tree.df$cancellation_policy %in% c("super_strict_30", "super_strict_60")] <- "strict"
nyc_tree.df$cancellation_policy<-as.factor(nyc_tree.df$cancellation_policy)

#turn log var to factors 
nyc_tree.df$host_identity_verified<-as.factor(nyc_tree.df$
                                                  host_identity_verified)
nyc_tree.df$cleaning_fee <- as.factor(nyc_tree.df$cleaning_fee)
nyc_tree.df$host_has_profile_pic <- as.factor(nyc_tree.df$
                                                 host_has_profile_pic)
nyc_tree.df$instant_bookable <- as.factor(nyc_tree.df$instant_bookable)
#Partition data 
str(nyc_tree.df)
## 'data.frame':    17856 obs. of  16 variables:
##  $ log_price             : num  5.13 4.98 4.61 4.6 3.69 ...
##  $ property_type         : Factor w/ 24 levels "Apartment","Bed & Breakfast",..: 1 1 1 15 15 1 1 1 1 1 ...
##  $ room_type             : Factor w/ 3 levels "Entire home/apt",..: 1 1 2 1 2 1 1 2 2 2 ...
##  $ accommodates          : int  7 5 2 8 2 6 4 3 2 2 ...
##  $ bathrooms             : num  1 1 1 1 1.5 1 1 1 1 1 ...
##  $ bed_type              : Factor w/ 5 levels "Airbed","Couch",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ cancellation_policy   : Factor w/ 3 levels "flexible","moderate",..: 3 2 3 3 3 3 3 3 3 2 ...
##  $ cleaning_fee          : Factor w/ 2 levels "False","True": 2 2 2 2 2 2 2 2 2 2 ...
##  $ host_has_profile_pic  : Factor w/ 2 levels "f","t": 2 2 2 2 2 2 2 2 2 2 ...
##  $ host_identity_verified: Factor w/ 2 levels "f","t": 1 2 2 1 2 2 1 1 2 2 ...
##  $ host_response_rate    : Factor w/ 3 levels "rarely","sometimes",..: 3 3 3 3 3 3 3 2 2 3 ...
##  $ instant_bookable      : Factor w/ 2 levels "f","t": 2 2 1 2 1 1 1 1 1 2 ...
##  $ number_of_reviews     : int  6 10 82 5 57 14 4 40 1 17 ...
##  $ review_scores_rating  : num  93 92 93 72 89 100 100 88 100 96 ...
##  $ bedrooms              : num  3 1 1 3 1 3 2 1 1 1 ...
##  $ beds                  : num  3 3 1 3 1 3 2 2 1 1 ...
summary(nyc_tree.df$cancellation_policy)
## flexible moderate   strict 
##     3333     4784     9739
set.seed(1)

idx <- createDataPartition(nyc_tree.df$cancellation_policy, p=0.6, list=FALSE)
train_tree.df <- nyc_tree.df[idx, ]
holdout_tree.df <- nyc_tree.df[-idx, ]

B. Determine the ideal size of your tree using cross-validation.

#tree 1
model1 <- rpart(cancellation_policy ~ ., data = train_tree.df,cp=0.001,method = "class")

# plot tree
rpart.plot(model1, extra=1, fallen.leaves=FALSE)

# second plotting option for model1
#rpart.plot(model1, extra = 104,
           #branch.lty = 4, shadow.col = "gray", nn = TRUE)

Model1 used the smallest cp value but it has many nodes and potentially making it more complex than needed and prone to overfitting.

#tree 2 
model2 <-rpart(cancellation_policy ~ ., data = train_tree.df,xval=5,
cp= 0.0024636,method = "class")

rpart.plot(model2, extra=1, fallen.leaves=FALSE)

#2nd option to plot model2
#rpart.plot(model2, extra = 104,
           #branch.lty = 4, shadow.col = "gray", nn = TRUE)

The classification tree plotted from model2 doesn’t include all of the outcome variable’s categories.

model3 <-rpart(cancellation_policy ~ ., data = train_tree.df,xval=5,
cp=0.0018477 ,method = "class")

rpart.plot(model3, extra=1, fallen.leaves=FALSE)

Compared to Model 1, Model 3 was simpler in structure. While it captured all three categories of the outcome variable, this separation occurred only at the last split.

#hyperparameter tuning 

control <- trainControl(method="repeatedcv", number=7, search="grid", classProbs
                        = TRUE, summaryFunction = multiClassSummary)

tunegrid <- expand.grid(cp=seq(from = 0.0001, to = 0.01, by=0.001))
dt_gridsearch <- train(cancellation_policy~., data=train_tree.df, method="rpart"
                       , metric="Accuracy", tuneGrid=tunegrid, trControl=control
                       )
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(dt_gridsearch)
## CART 
## 
## 10715 samples
##    15 predictor
##     3 classes: 'flexible', 'moderate', 'strict' 
## 
## No pre-processing
## Resampling: Cross-Validated (7 fold, repeated 1 times) 
## Summary of sample sizes: 9184, 9184, 9185, 9185, 9184, 9184, ... 
## Resampling results across tuning parameters:
## 
##   cp      logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
##   0.0001  1.9852869  0.6249356  0.4400766  0.5113402  0.1422480  0.4313485
##   0.0011  0.9688574  0.5771863  0.3692732  0.5705098  0.1353342  0.3821414
##   0.0021  0.9608344  0.5765844  0.2984229  0.5691096  0.1297495  0.3828543
##   0.0031  0.9612210  0.5767632  0.1915386  0.5691096  0.1197092  0.3949533
##   0.0041  0.9619828  0.5764966  0.1905609  0.5678966  0.1195936  0.3949533
##   0.0051  0.9622986  0.5755780  0.1869293  0.5683634  0.1205048        NaN
##   0.0061  0.9622986  0.5755780  0.1869293  0.5683634  0.1205048        NaN
##   0.0071  0.9622986  0.5755780  0.1869293  0.5683634  0.1205048        NaN
##   0.0081  0.9622986  0.5755780  0.1869293  0.5683634  0.1205048        NaN
##   0.0091  0.9622986  0.5755780  0.1869293  0.5683634  0.1205048        NaN
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value  Mean_Neg_Pred_Value
##   0.4287626         0.7125465         0.4415810            0.7183194          
##   0.4157701         0.7043482         0.5071865            0.7653037          
##   0.4114668         0.7029044         0.4877401            0.7650505          
##   0.4065537         0.6993735         0.4907577            0.7687927          
##   0.4070165         0.6993289         0.4907577            0.7647393          
##   0.4081123         0.6994388               NaN            0.7654225          
##   0.4081123         0.6994388               NaN            0.7654225          
##   0.4081123         0.6994388               NaN            0.7654225          
##   0.4081123         0.6994388               NaN            0.7654225          
##   0.4081123         0.6994388               NaN            0.7654225          
##   Mean_Precision  Mean_Recall  Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.4415810       0.4287626    0.1704467            0.5706545             
##   0.5071865       0.4157701    0.1901699            0.5600591             
##   0.4877401       0.4114668    0.1897032            0.5571856             
##   0.4907577       0.4065537    0.1897032            0.5529636             
##   0.4907577       0.4070165    0.1892989            0.5531727             
##         NaN       0.4081123    0.1894545            0.5537755             
##         NaN       0.4081123    0.1894545            0.5537755             
##         NaN       0.4081123    0.1894545            0.5537755             
##         NaN       0.4081123    0.1894545            0.5537755             
##         NaN       0.4081123    0.1894545            0.5537755             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.0011.
plot(dt_gridsearch)

dt_gridsearch$results
##        cp   logLoss       AUC     prAUC  Accuracy     Kappa   Mean_F1
## 1  0.0001 1.9852869 0.6249356 0.4400766 0.5113402 0.1422480 0.4313485
## 2  0.0011 0.9688574 0.5771863 0.3692732 0.5705098 0.1353342 0.3821414
## 3  0.0021 0.9608344 0.5765844 0.2984229 0.5691096 0.1297495 0.3828543
## 4  0.0031 0.9612210 0.5767632 0.1915386 0.5691096 0.1197092 0.3949533
## 5  0.0041 0.9619828 0.5764966 0.1905609 0.5678966 0.1195936 0.3949533
## 6  0.0051 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048       NaN
## 7  0.0061 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048       NaN
## 8  0.0071 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048       NaN
## 9  0.0081 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048       NaN
## 10 0.0091 0.9622986 0.5755780 0.1869293 0.5683634 0.1205048       NaN
##    Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
## 1         0.4287626        0.7125465           0.4415810           0.7183194
## 2         0.4157701        0.7043482           0.5071865           0.7653037
## 3         0.4114668        0.7029044           0.4877401           0.7650505
## 4         0.4065537        0.6993735           0.4907577           0.7687927
## 5         0.4070165        0.6993289           0.4907577           0.7647393
## 6         0.4081123        0.6994388                 NaN           0.7654225
## 7         0.4081123        0.6994388                 NaN           0.7654225
## 8         0.4081123        0.6994388                 NaN           0.7654225
## 9         0.4081123        0.6994388                 NaN           0.7654225
## 10        0.4081123        0.6994388                 NaN           0.7654225
##    Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
## 1       0.4415810   0.4287626           0.1704467              0.5706545
## 2       0.5071865   0.4157701           0.1901699              0.5600591
## 3       0.4877401   0.4114668           0.1897032              0.5571856
## 4       0.4907577   0.4065537           0.1897032              0.5529636
## 5       0.4907577   0.4070165           0.1892989              0.5531727
## 6             NaN   0.4081123           0.1894545              0.5537755
## 7             NaN   0.4081123           0.1894545              0.5537755
## 8             NaN   0.4081123           0.1894545              0.5537755
## 9             NaN   0.4081123           0.1894545              0.5537755
## 10            NaN   0.4081123           0.1894545              0.5537755
##      logLossSD       AUCSD     prAUCSD  AccuracySD    KappaSD   Mean_F1SD
## 1  0.146099665 0.005329671 0.006805516 0.010355840 0.01145506 0.009940498
## 2  0.020879851 0.008838694 0.059305894 0.005957622 0.01638366 0.010170023
## 3  0.008400957 0.009307469 0.086011836 0.005536329 0.01583959 0.008534250
## 4  0.008359484 0.008884793 0.006283672 0.005010466 0.01760796          NA
## 5  0.008675200 0.009083181 0.007087852 0.005743311 0.01795763          NA
## 6  0.008211788 0.008278361 0.006483607 0.006403378 0.01773355          NA
## 7  0.008211788 0.008278361 0.006483607 0.006403378 0.01773355          NA
## 8  0.008211788 0.008278361 0.006483607 0.006403378 0.01773355          NA
## 9  0.008211788 0.008278361 0.006483607 0.006403378 0.01773355          NA
## 10 0.008211788 0.008278361 0.006483607 0.006403378 0.01773355          NA
##    Mean_SensitivitySD Mean_SpecificitySD Mean_Pos_Pred_ValueSD
## 1         0.008001416        0.003447328           0.014474713
## 2         0.010388683        0.004784540           0.011348196
## 3         0.009945135        0.004812990           0.009424466
## 4         0.010959149        0.005288428                    NA
## 5         0.010911660        0.005405074                    NA
## 6         0.011612401        0.004974672                    NA
## 7         0.011612401        0.004974672                    NA
## 8         0.011612401        0.004974672                    NA
## 9         0.011612401        0.004974672                    NA
## 10        0.011612401        0.004974672                    NA
##    Mean_Neg_Pred_ValueSD Mean_PrecisionSD Mean_RecallSD Mean_Detection_RateSD
## 1            0.004665379      0.014474713   0.008001416           0.003451947
## 2            0.008788118      0.011348196   0.010388683           0.001985874
## 3            0.008477817      0.009424466   0.009945135           0.001845443
## 4            0.006664328               NA   0.010959149           0.001670155
## 5            0.008440020               NA   0.010911660           0.001914437
## 6            0.009256509               NA   0.011612401           0.002134459
## 7            0.009256509               NA   0.011612401           0.002134459
## 8            0.009256509               NA   0.011612401           0.002134459
## 9            0.009256509               NA   0.011612401           0.002134459
## 10           0.009256509               NA   0.011612401           0.002134459
##    Mean_Balanced_AccuracySD
## 1               0.005411302
## 2               0.007561411
## 3               0.007307516
## 4               0.008097106
## 5               0.008134632
## 6               0.008279338
## 7               0.008279338
## 8               0.008279338
## 9               0.008279338
## 10              0.008279338
confusionMatrix(dt_gridsearch)
## Cross-Validated (7 fold, repeated 1 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction flexible moderate strict
##   flexible      5.3      2.3    2.9
##   moderate      0.3      0.7    0.7
##   strict       13.0     23.8   51.0
##                             
##  Accuracy (average) : 0.5705
dt_gridsearch$bestTune
##       cp
## 2 0.0011
#0.0011 best CP value for classification tree 

Hyperparameter tuning was used to determine the most effective cp value for the classification tree. The final model had a cp value of 0.0011.

C. Using rpart.plot and your choice of graphical parameters, show your tree model here.

#final classification tree 
model4 <-rpart(cancellation_policy ~ ., data = train_tree.df,xval=5,
cp=0.0011 ,method = "class")


rpart.plot(model4, extra=1, fallen.leaves=FALSE)

rpart.plot(model4, 
           type = 2, 
           extra = 104, 
           fallen.leaves = TRUE,
           cex = 0.5)    

#create image of tree model b/c difficult to view 
png("model4.png", width = 3000, height = 2000, res = 300)
rpart.plot(model4, type = 2, extra = 104, fallen.leaves = TRUE, cex = 0.5)
dev.off()
## quartz_off_screen 
##                 2

D. In a 1-2 paragraph write-up, describe your process. Talk about some of the features that you considered using, and your reasons why. Mention anything that you found interesting as you explored various possible models, and the process you used to arrive at the model you finished with.

A classification tree model was created to predict the classification of a listing’s cancellation policy. The cancellation policy was binned into three categories: strict, moderate, and flexible. Multiple trees were plotted using different cp values to get a sense of how the tree’s performance and complexity changed. Hyperparameter tuning was then used to determine the most effective cp value, which ended up being 0.0011. This resulted in a more complex tree with many nodes, but trees with higher cp values either excluded the moderate category or only included it at the very last split. This could be due to the moderate class being underrepresented in the data, which makes it harder for simpler trees to include it.

In the end, the model with a cp of 0.0011 was selected. While it’s complex, it is able to represent all three cancellation policy categories throughout the tree. The classification tree model’s average accuracy rate is 57%.

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

Step IV. Clustering

I. Perform either a k-means analysis or a hierarchical clustering analysis to compare any selected list of neighborhoods in your city.

# Create a subset of neighborhoods

# View neighborhood names
unique_neighborhoods <- nyc_data %>%
  filter(!is.na(neighbourhood)) %>%
  distinct(neighbourhood) %>%
  pull(neighbourhood)

# Select n random neighborhoods
n <- 60  #randomly select n neighborhoods
selected_neighborhoods <- sample(unique_neighborhoods, n)

# Drop NA values
nyc_data_clean <- nyc_data %>%
  drop_na()

# Create subset
nyc_subset <- nyc_data_clean %>%
  filter(neighbourhood %in% selected_neighborhoods)
# Feature Engineering and Summarization 
# Add a column for price
nyc_engineered <- nyc_subset %>%
  mutate(
    price = exp(log_price),
    price_per_person = price / accommodates,
    bedroom_to_guest_ratio = bedrooms / accommodates,
    has_cleaning_fee = ifelse(cleaning_fee == "True", 1, 0),
    is_verified_host = ifelse(host_identity_verified == "t", 1, 0)
  )
    
neighborhood_summary <- nyc_engineered %>%
  group_by(neighbourhood) %>%
  summarise(
    avg_price = mean(price, na.rm = TRUE),
    avg_beds = mean(beds, na.rm = TRUE),
    avg_reviews = mean(number_of_reviews, na.rm = TRUE),
    avg_review_score = mean(review_scores_rating, na.rm = TRUE),
    price_per_person = mean(price_per_person, na.rm = TRUE),
    bedroom_to_guest_ratio = mean(bedroom_to_guest_ratio, na.rm = TRUE),
    pct_cleaning_fee = mean(has_cleaning_fee, na.rm = TRUE),
    pct_verified_host = mean(is_verified_host, na.rm = TRUE),
    listing_count = n(),
  ) %>%
  column_to_rownames("neighbourhood")

head(neighborhood_summary)
##                   avg_price avg_beds avg_reviews avg_review_score
## Annadale          106.00000 2.000000    49.00000         99.00000
## Astoria           100.91340 1.618557    31.97113         93.96289
## Battery Park City 186.42857 1.000000    13.28571         96.57143
## Bedford Park       60.83333 1.500000    25.00000         92.66667
## Bensonhurst        85.19231 2.038462    18.19231         92.26923
## Brooklyn Heights  171.59615 1.653846    21.21154         96.00000
##                   price_per_person bedroom_to_guest_ratio pct_cleaning_fee
## Annadale                  26.50000              0.2500000        0.0000000
## Astoria                   38.78780              0.5004849        0.7567010
## Battery Park City         93.21429              0.2142857        1.0000000
## Bedford Park              27.17361              0.4791667        0.9166667
## Bensonhurst               39.53388              0.5279762        0.7692308
## Brooklyn Heights          61.36635              0.3883814        0.7692308
##                   pct_verified_host listing_count
## Annadale                  0.0000000             1
## Astoria                   0.6515464           485
## Battery Park City         0.4285714             7
## Bedford Park              1.0000000            12
## Bensonhurst               0.4230769            26
## Brooklyn Heights          0.7307692            52
# Standardize features
neighborhood_scaled <- scale(neighborhood_summary)

# Remove NAs and zero-variance columns
neighborhood_scaled_clean <- as.data.frame(neighborhood_scaled) %>%
  na.omit()
neighborhood_scaled_clean <- neighborhood_scaled_clean[, apply(neighborhood_scaled_clean, 2, sd, na.rm = TRUE) != 0]
# Determine optimal number of clusters

fviz_nbclust(neighborhood_scaled_clean, kmeans, method = "wss")

# Perform K-means clustering
set.seed(123)
k <- 3
kmeans_result <- kmeans(neighborhood_scaled_clean, centers = k, nstart = 50)

# Assign cluster labels (ensure alignment)
neighborhood_summary_clean <- neighborhood_summary[rownames(neighborhood_scaled_clean), ]
neighborhood_summary_clean$cluster <- as.factor(kmeans_result$cluster)

# Visualize clusters using PCA projection of standardized data
fviz_cluster(kmeans_result, 
             data = neighborhood_scaled_clean, 
             geom = "point", repel = TRUE, labelsize = 8,
             main = "K-Means Clustering of NYC Neighborhoods",
             ellipse.type = "convex")

# Create a new column to join back neighborhood names
neighborhood_summary_named <- neighborhood_summary_clean %>%
  rownames_to_column("neighbourhood")

# Create cluster profile summary
cluster_profile <- neighborhood_summary_named %>%
  group_by(cluster) %>%
  summarise(
    avg_price = round(mean(avg_price), 2),
    avg_reviews = round(mean(avg_reviews), 1),
    avg_review_score = round(mean(avg_review_score), 1),
    bedroom_to_guest_ratio = round(mean(bedroom_to_guest_ratio), 2),
    pct_cleaning_fee = round(mean(pct_cleaning_fee), 2),
    avg_listings_per_neighborhood = round(mean(listing_count), 1),
    n_neighborhoods = n()
  )

# Print table
print(cluster_profile)
## # A tibble: 3 × 8
##   cluster avg_price avg_reviews avg_review_score bedroom_to_guest_ratio
##   <fct>       <dbl>       <dbl>            <dbl>                  <dbl>
## 1 1            87.2        25.3             93.4                   0.46
## 2 2           182.         24.8             94.9                   0.43
## 3 3            77.5        20.8             98.1                   0.61
## # ℹ 3 more variables: pct_cleaning_fee <dbl>,
## #   avg_listings_per_neighborhood <dbl>, n_neighborhoods <int>
# Show neighborhoods by cluster
cluster_composition <- neighborhood_summary_named %>%
  select(neighbourhood, cluster) %>%
  arrange(cluster, neighbourhood)

# View table
print(cluster_composition)
##                 neighbourhood cluster
## 1                Bedford Park       1
## 2                 Bensonhurst       1
## 3          Brooklyn Navy Yard       1
## 4                 Brownsville       1
## 5                    Canarsie       1
## 6                   Claremont       1
## 7  Columbia Street Waterfront       1
## 8                   Concourse       1
## 9                      Corona       1
## 10               Country Club       1
## 11         Ditmars / Steinway       1
## 12              Dyker Heights       1
## 13              East Elmhurst       1
## 14              East New York       1
## 15                  Flatlands       1
## 16               Forest Hills       1
## 17                  Gravesend       1
## 18            Jackson Heights       1
## 19           Kew Garden Hills       1
## 20                Kingsbridge       1
## 21        Kingsbridge Heights       1
## 22             Middle Village       1
## 23                    Midwood       1
## 24                Morris Park       1
## 25               New Brighton       1
## 26              Randall Manor       1
## 27                  Rego Park       1
## 28              Richmond Hill       1
## 29                  Riverdale       1
## 30                South Beach       1
## 31                  Stapleton       1
## 32              The Rockaways       1
## 33        Westchester Village       1
## 34                   Woodside       1
## 35                    Astoria       2
## 36          Battery Park City       2
## 37           Brooklyn Heights       2
## 38            Carroll Gardens       2
## 39               Civic Center       2
## 40               Clinton Hill       2
## 41                East Harlem       2
## 42              Hudson Square       2
## 43       Meatpacking District       2
## 44                    Midtown       2
## 45                Murray Hill       2
## 46                     Nolita       2
## 47                 Park Slope       2
## 48           Prospect Heights       2
## 49       South Street Seaport       2
## 50            Upper West Side       2
## 51               West Village       2
## 52                   Annadale       3
## 53                 Co-op City       3
## 54                Marine Park       3
## 55                 Mount Eden       3
## 56                   Van Nest       3

II. Show your code and results, and write two paragraphs describing what you included in your model, and what you found about the neighborhoods.

In our clustering model, we incorporated a variety of engineered features that describe the market characteristics of the Airbnbs in 45 randomly selected neighborhoods of New York City. These included: the average price of listings (reverse-transformed from a log scale), the average number of reviews as a proxy for demand, the average review score as a measure of customer satisfaction, the ratio of bedrooms to guests (an efficiency metric), the percentage of listings that charge a cleaning fee, and the average number of listings per neighborhood to approximate market density. We selected these variables to ensure a balance between pricing, guest experience, operational practices, and market activity.

The resulting k-means model revealed three distinct clusters. Cluster 1 consists of moderately priced neighborhoods with the highest number of reviews and listings, possibly reflecting high-demand, high-traffic areas with competitive but not top-tier pricing. Cluster 2 represents high-price, low-volume neighborhoods with fewer listings, lower review counts, and the highest cleaning fee incidence, indicating likely premium or luxury areas. Cluster 3 comprises the lowest-priced neighborhoods, with moderate to low reviews and a higher bedroom-to-guest ratio, suggesting more budget-oriented accommodations with more generous space configurations. These insights offer a nuanced segmentation of NYC’s short-term rental market and provide a foundation for future pricing or operational strategies.

Step V. Conclusions

I. Write a 3-5 paragraph summary that describes your overall process and experience with this assignment. How could these findings be useful? Who could benefit from the data mining that you have performed here?

In this project, we explored Airbnb listings in NYC through multiple lenses: regression, classification, and clustering. Our goal was to understand what drives price, how to predict it, and how to group different segments of the market.

For predicting log_price, the multiple regression model performed the best. With an R- squared of 0.645 and an RMSE of 0.387, it explained about 65% of the price variation using meaningful predictors like number of bedrooms, bathrooms, review scores, and room type. This model offers strong, interpretable insights and is reliable for price estimation, especially in continuous form.

On the other hand, Naive Bayes and decision tree models attempted to classify listings into price categories. While helpful for broad price segmentation, the model had modest accuracy (52%) and made simplifying assumptions that reduced their predictive power. Naive Bayes assumes independence among variables, which isn’t always realistic, while decision trees struggled with minority categories like “moderate” cancellation policies.

The KNN model for cleaning fee classification achieved 70.74% accuracy but was heavily biased toward the majority class (“yes”). Its low Kappa score (0.17) and poor recall for “no” listings suggest it’s not suitable for this type of unbalanced classification problem.

Lastly, the K-means clustering analysis helped us uncover patterns across 100 NYC neighborhoods. It grouped areas into three meaningful clusters: luxury, high-demand, and budget-friendly markets. While this method doesn’t predict individual prices, it adds valuable market-level context and supports strategic decisions, such as identifying target neighborhoods for expansion or adjusting service offerings.

If our goal is to analyze and predict listing prices (in log scale), the multiple linear regression model is the best choice. It offers both accuracy and clarity, helping users understand what drives price variations across NYC’s Airbnb market.